FEAT Add response_parser hook to SelfAskTrueFalseScorer with LlamaGuard support#1867
FEAT Add response_parser hook to SelfAskTrueFalseScorer with LlamaGuard support#1867immu4989 wants to merge 2 commits into
Conversation
…rd support Per the design discussion in microsoft#1830, extend SelfAskTrueFalseScorer with an optional response_parser callable so the same scorer can wrap fine-tuned safety classifiers (LlamaGuard, ShieldGemma, WildGuard, HarmBench-paper) whose output is not JSON. Default behavior is unchanged. Ships a parse_llamaguard_response helper plus YAML assets (TrueFalseQuestion and system prompt) so users can drop in any LlamaGuard-serving endpoint via PromptChatTarget. No local transformers or torch dependency. Also fixes a latent typing issue in Scorer._score_value_with_llm: score_value_description now defaults to '' when the response omits the description field, instead of being None against a str-typed field.
romanlutz
left a comment
There was a problem hiding this comment.
I don't have a llama-guard deployment and can't test this. Can you confirm that you did test it?
| @@ -0,0 +1,18 @@ | |||
| category: llamaguard | |||
There was a problem hiding this comment.
This YAML is added under pyrit/datasets/score/true_false_question/ but it's never referenced anywhere in the code: there's no TrueFalseQuestionPaths.LLAMAGUARD enum entry, no usage in the new tests, and the parser docstring doesn't mention it. Users following the integration tests as the example will construct a TrueFalseQuestion inline and never discover this file.
Same comment applies to llamaguard_system_prompt.yaml — it's not wired into anything either.
I'd suggest to wire them in: add a TrueFalseQuestionPaths.LLAMAGUARD enum value pointing at this file, and reference the system-prompt path from the parser's docstring (or expose it as a module-level constant alongside parse_llamaguard_response). That's the user-discoverable path.
| parameters: | ||
| - true_description | ||
| - false_description | ||
| - metadata |
There was a problem hiding this comment.
parameters declares true_description, false_description, and metadata, but the value: template below is fully static — none of these are referenced via {{ ... }}. render_template_value happily ignores extra kwargs, so this won't fail at runtime, but the declaration is misleading: someone editing the prompt later will assume the descriptions are interpolated and that overrides via true_false_question flow into the prompt. With LlamaGuard they don't (and shouldn't — the classifier ignores prompt-embedded categories anyway).
Either drop the parameters list, or actually reference the variables in the template if you want overrides to take effect.
| response text. Must return a dict containing at least ``score_value_output_key`` | ||
| and ``rationale_output_key``; may also include ``description_output_key``, | ||
| ``metadata_output_key``, and ``category_output_key``. Should raise | ||
| :class:`InvalidJsonException` on malformed output so the ``@pyrit_json_retry`` |
There was a problem hiding this comment.
:class:InvalidJsonException`` is reStructuredText cross-reference syntax. PyRIT's docs build uses MyST (Markdown-flavoured), so this renders as literal text (the reST role isn't interpreted) instead of a cross-reference. Convention in this codebase is plain double-backticks for symbol names.
| :class:`InvalidJsonException` on malformed output so the ``@pyrit_json_retry`` | |
| ``InvalidJsonException`` on malformed output so the ``@pyrit_json_retry`` |
Same issue in pyrit/score/true_false/self_ask_true_false_scorer.py line 133 — change :class:pyrit.exceptions.InvalidJsonExceptionto `InvalidJsonException` (or `pyrit.exceptions.InvalidJsonException` `` if you want the fully-qualified name).
| "LikertScaleEvalFiles", | ||
| "LikertScalePaths", | ||
| "MarkdownInjectionScorer", | ||
| "parse_llamaguard_response", |
There was a problem hiding this comment.
alphabetical order please
| Defaults to "category". | ||
| attack_identifier (Optional[ComponentIdentifier]): The attack identifier. | ||
| Defaults to None. | ||
| response_parser (Optional[Callable[[str], dict[str, Any]]]): Custom parser for |
There was a problem hiding this comment.
Scorer needn't be LLM-based so I think we don't want it at this level. One could argue we should consider how inheritance/interfaces work here but that's a bit out of scope.
Fixes #1830.
Implements the parser-pluggable approach @romanlutz approved in #1830.
SelfAskTrueFalseScorergains aresponse_parserhook so the same scorer can wrap fine-tuned classifiers like LlamaGuard whose output is not JSON. This avoids needing a new scorer class for every safety classifier and gives PyRIT a place to land ShieldGemma, WildGuard, and the HarmBench-paper classifier later without reinventing the abstraction.Why a parser hook
SelfAskTrueFalseScorer's system prompt (true_false_system_prompt.yaml) instructs the scorer LLM to emit a JSON object withscore_value,description, andrationale.Scorer._score_value_with_llmparses that JSON. The contract works for a general instruction-following LLM but breaks for LlamaGuard, which is a fine-tuned classifier whose output is hard-coded to"safe"or"unsafe\n<comma-separated category codes>". LlamaGuard ignores any "respond as JSON" instruction because that format is not part of its training. A parser override is required.Changes
In
pyrit/score/scorer.py,Scorer._score_value_with_llmgains an optionalresponse_parser: Callable[[str], dict[str, Any]]kwarg. When provided, it replaces the defaultjson.loads(remove_markdown_json(...))step. Default behavior is unchanged. The edit also fixes a latent typing issue surfaced by stricter inference:score_value_descriptionnow defaults to""when missing from the response.SelfAskTrueFalseScorer(inpyrit/score/true_false/self_ask_true_false_scorer.py) gets a matchingresponse_parserkwarg and threads it through to_score_value_with_llm. Existing callers see no change.A new helper at
pyrit/score/true_false/llamaguard_parser.pyprovidesparse_llamaguard_response(text). It maps"safe"toscore_value="False"and"unsafe\n<categories>"toscore_value="True"with the violated category codes placed onscore_metadata["violated_categories"]. On malformed output it raisesInvalidJsonExceptionso@pyrit_json_retryretries the LLM call.Two new YAML assets ship under
pyrit/datasets/score/true_false_question/:llamaguard.yaml: aTrueFalseQuestioncovering the MLCommons safety taxonomy (S1-S14) for thellamaguardcategory.llamaguard_system_prompt.yaml: a system prompt template that fits PyRIT's system-prompt + user-message contract. The header documents that users wanting strict fidelity to the official Meta chat template can override viatrue_false_system_prompt_path.pyrit/score/__init__.pyexportsparse_llamaguard_response.Usage
Works with HuggingFace Inference, Together, Groq, Fireworks, a local vLLM/TGI, or any OpenAI-compatible endpoint serving Llama-Guard-3-8B, LlamaGuard-7B, or Llama-Guard-3-1B. No local
transformersortorchdependency.Tests
The new file
tests/unit/score/test_llamaguard_parser.pycontains 15 tests.safe, mixed-caseSafe, whitespace,unsafewith single, multiple, missing, and empty category lines, plus empty input, a refusal string, and a malformed verdict.SelfAskTrueFalseScorerwithresponse_parser=parse_llamaguard_responseagainst a mocked target, for both safe and unsafe-with-categories paths.response_parserkeeps the JSON parsing path.Verification
Out of scope for this PR
Three natural follow-ons that fit the pattern introduced here:
response_parserplumbing.Llama-Guard-3-11B-Vision.